Upstream methods

Long-reads sequencing (ONT)

For the ONT samples, (Unmodified) raw sequences were extracted from fast5 files, trimmed using chopper to remove low quality bases from the reads. Alignments were carried out using minimap2 against the mitochondrial chromosome (rCRS) and the chr22 autosome. Due to the circular nature of the mitochondrial chromosome, reads were aligned in parallel to an alternative version of the rCRS whose start-end positions were shifted by 8,000 bp. This allowed long reads spanning over the arbitrary start-end regions in the rCRS to be properly aligned. Only primary alignments were considered (SAM flags 0 and 16).

SNV calling was carried out using GATK Mutect2 in mitochondria mode for each of the three alignments (rCRS, shifted rCRS and chr22). The variants in the VCF resulting from the alignments of long reads against the 8,000 bp-shifted rCRS were lifted over to the original rCRS coordinates. A final set of mtDNA variants was obtained by combining variants within positions 4,000-11,999 from the standard rCRS reference, and within 1-3,999 and 12,000-16,569 from the alignment against the 8,000 bp-shifted reference. Haplocheck was used to test for haplotype contamination in the resulting filtered mtDNA variants.

To use as a nuclear control, we used the same methodology to call SNVs in a short region from the chr22 (chr22:20,000,000-20,100,000).

To obtain deletions at the single-read level from the ONT data, we parsed the sam CIGAR strings in the mitomap2 alignments using a custom script. Similarly, we parsed CIGAR strings to obtain the number of matching (=) and mismatching (X) positions in each single-read alignment.

See PIPELINE_short.sh for the commands used. Scripts and executables are located in the bin/ folder.

Illumina samples

The two EDTA blood samples sequenced using Illumina NGS technology were aligned against the whole genome using the GATK recommended guidelines, following the mitochondrial-specific pipeline for variant calling.

Sample information

Metadata with sample information:

Metadata <- read_csv("./Tables/Metadata.txt", col_types = "cccc") %>%
    mutate(tissue = ifelse(str_starts(tissue, "edta_blood"), "edta_blood", tissue)) %>%
    mutate(platform = seq) %>%
    select(-seq)
Metadata %>%
    arrange(platform, tissue) %>%
    kable(caption = "Sample overview") %>% 
    kable_styling(bootstrap_options = "striped", full_width = FALSE)
Sample overview
sample_id tissue individual_id platform
SL443862 edta_blood c illumina
SL443863 edta_blood d illumina
6226-CT-0010 PFC a ont
6226-CT-0018 cerebellum b ont
6226-CT-0015 colon b ont
6226-CT-0016 edta_blood c ont
6226-CT-0017 edta_blood d ont
6226-CT-0011 heart b ont
6226-CT-0014 kidney b ont
6226-CT-0013 liver b ont
6226-CT-0012 muscle b ont

NOTE: sample from PFC is from a different individual (a), and EDTA blood samples (both Illumina and ONT sequencing samples) are yet from different individuals (c and d).

General sequencing QC overview

mtDNA sequencing coverage

The following plot shows the rCRS coverage for all ONT samples (including those belonging to different individuals and tissues). Different colours correspond to different individuals.

Depth of coverage is calculated using samtools depth on both mtDNA alignments (rCRS and shifted rCRS), but positions are subsetted to 4,000-12,000 for the rCRS and 4,000-12,569 for the shifted rCRS, which correspond to chrM:1-4000 and chrM:12,000-16,569 when shifted back).

Depth of coverage for chr22 is restricted to chr22:20,000,000-20,100,000 for simplicity.

depthFiles1 <- paste0("../results/depth/", 
                      filter(Metadata, platform == "ont")$sample_id,
                      "_chrM.txt")
names(depthFiles1) <- filter(Metadata, platform == "ont")$sample_id
#all(file.exists(depthFiles1))

depthFiles2 <- paste0("../results/depth/", 
                      filter(Metadata, platform == "ont")$sample_id, 
                      "_chrM_shifted.txt")
names(depthFiles2) <- filter(Metadata, platform == "ont")$sample_id
#all(file.exists(depthFiles2))

depthFiles22 <- paste0("../results/depth/", 
                       filter(Metadata, platform == "ont")$sample_id,
                       "_chr22.txt")
names(depthFiles22) <- filter(Metadata, platform == "ont")$sample_id
#all(file.exists(depthFiles22))

Cov <- bind_rows(
    lapply(names(depthFiles1), function(sid) {
        read_tsv(depthFiles1[sid], col_names = FALSE, col_types = "cii") %>%
            mutate(sample_id = sid) %>%
            select(Pos = X2, Cov = X3, sample_id) %>%
            filter(Pos >= 4000, Pos < 12000)
    }) %>% Reduce("bind_rows", .),
    lapply(names(depthFiles2), function(sid) {
        read_tsv(depthFiles1[sid], col_names = FALSE, col_types = "cii") %>%
            mutate(sample_id = sid) %>%
            select(Pos = X2, Cov = X3, sample_id) %>%
            filter(Pos >= 4000, Pos < 12569) %>%
            mutate(Pos = sapply(Pos, us8k))
    }) %>% Reduce("bind_rows", .)
) %>% filter(Pos != 3107) %>%
    arrange(sample_id, Pos)

plotLabels <- left_join(Cov, select(Metadata, sample_id, tissue, individual_id), by = join_by(sample_id)) %>%
    group_by(tissue) %>%
    filter(Pos == 16569) %>%
    ungroup()

Plot1 <- left_join(Cov, Metadata, by = join_by(sample_id)) %>%
    ggplot() +
    geom_smooth(aes(x = Pos, y = Cov, group = sample_id, colour = individual_id), 
        method = "gam", formula = y ~ s(x, bs = "cs", fx = TRUE, k = 100), se = FALSE) +
    geom_text_repel(data = plotLabels, aes(x = Pos, y = Cov, label = tissue),
        nudge_x = 1000, direction = "y", 
        box.padding = 0.5, point.padding = 0.5, 
        segment.color = "grey50", size = 5) +
    scale_y_log10() +
    theme_cowplot() +
    theme(legend.position = "none") +
    ggtitle("Absolute coverage rCRS")
Plot1

# invisible(svg("./Figs/coverage_ONT_all.svg", height = 6, width = 12))
# Plot1
# invisible(dev.off())
invisible(png("./Figs/coverage_ONT_all.png", height = 1000, width = 2000, res = 150))
Plot1
invisible(dev.off())

The high levels of coverage seen in the the mitochondrial chromosome contrast with the more sparse nuclear coverage. Based on the 100,000bp region from chr22, the nuclear genome has a coverage of 1-3 orders of magnitude lower.

In addition, the nuclear coverage is much more similar across different tissues than the mitochondrial, which exhibits very large differences (e.g., ~400X in EDTA blood vs ~200,000X in heart tissue).

The following table displays summary statistics for both the rCRS coverage (top) and the 100,000bp region in chr22 (bottom).

Cov22 <- lapply(names(depthFiles22), function(sid) {
    read_tsv(depthFiles22[sid], col_names = FALSE, col_types = "cii") %>%
        mutate(sample_id = sid) %>%
        select(Pos = X2, Cov = X3, sample_id)
}) %>% Reduce("bind_rows", .)

summTable <- left_join(Cov, Metadata, by = join_by(sample_id)) %>%
    group_by(sample_id) %>%
    summarise(mean_cov = round(mean(Cov)),
              stdev_cov = round(sd(Cov))) %>%
    ungroup() %>%
    left_join(Metadata, by = join_by(sample_id)) %>%
    select(tissue, individual_id, sample_id, mean_cov, stdev_cov)

summTable22 <- left_join(Cov22, Metadata, by = join_by(sample_id)) %>%
    group_by(sample_id) %>%
    summarise(mean_cov = round(mean(Cov)),
              stdev_cov = round(sd(Cov))) %>%
    ungroup() %>%
    left_join(Metadata, by = join_by(sample_id)) %>%
    select(tissue, individual_id, sample_id, mean_cov, stdev_cov)

summTable %>%
    kable(caption = "Summary statistics rCRS coverage per sample") %>%
    kable_styling(bootstrap_options = "striped", full_width = FALSE)
Summary statistics rCRS coverage per sample
tissue individual_id sample_id mean_cov stdev_cov
PFC a 6226-CT-0010 7456 407
heart b 6226-CT-0011 199966 9417
muscle b 6226-CT-0012 17489 723
liver b 6226-CT-0013 18876 1559
kidney b 6226-CT-0014 12790 574
colon b 6226-CT-0015 3321 135
edta_blood c 6226-CT-0016 436 44
edta_blood d 6226-CT-0017 452 55
cerebellum b 6226-CT-0018 3705 178
summTable22 %>%
    kable(caption = "Summary statistics chr22:20,000,000-20,100,000 coverage per sample") %>%
    kable_styling(bootstrap_options = "striped", full_width = FALSE)
Summary statistics chr22:20,000,000-20,100,000 coverage per sample
tissue individual_id sample_id mean_cov stdev_cov
PFC a 6226-CT-0010 63 105
heart b 6226-CT-0011 111 201
muscle b 6226-CT-0012 81 155
liver b 6226-CT-0013 60 105
kidney b 6226-CT-0014 114 208
colon b 6226-CT-0015 42 76
edta_blood c 6226-CT-0016 39 70
edta_blood d 6226-CT-0017 40 72
cerebellum b 6226-CT-0018 87 163

This data can be used to estimate the a relative copy number across tissues by normalising mtDNA:nuclear coverage. This represents only a relative copy number estimate of the average in the tissue (i.e., different cell types within a tissue have different mtDNA copy numbers).

summTable_ont <- summTable %>% select(tissue, individual_id, sample_id, mean_cov_MT = mean_cov) %>%
    full_join(summTable22 %>% select(tissue, individual_id, sample_id, mean_cov_chr22 = mean_cov),
              by = join_by(tissue, individual_id, sample_id)) %>%
    mutate(relative_copynumber = round(mean_cov_MT/mean_cov_chr22))
summTable_ont %>% 
    kable(caption = "Relative mtDNA copy number based on avg. depth of coverage") %>%
    kable_styling(bootstrap_options = "striped", full_width = FALSE)
Relative mtDNA copy number based on avg. depth of coverage
tissue individual_id sample_id mean_cov_MT mean_cov_chr22 relative_copynumber
PFC a 6226-CT-0010 7456 63 118
heart b 6226-CT-0011 199966 111 1801
muscle b 6226-CT-0012 17489 81 216
liver b 6226-CT-0013 18876 60 315
kidney b 6226-CT-0014 12790 114 112
colon b 6226-CT-0015 3321 42 79
edta_blood c 6226-CT-0016 436 39 11
edta_blood d 6226-CT-0017 452 40 11
cerebellum b 6226-CT-0018 3705 87 43

Is the mtDNA copy number estimation similar in Illumina-based sequencing? We can use the samples for which we have both ONT and Illumina to investigate this.

depthFiles_wgs_1 <- paste0("../results/depth/", 
                           filter(Metadata, platform == "illumina")$sample_id, 
                           "_chrM.txt")
names(depthFiles_wgs_1) <- filter(Metadata, platform == "illumina")$sample_id
#all(file.exists(depthFiles_wgs_1))

depthFiles_wgs_2 <- paste0("../results/depth/", 
                           filter(Metadata, platform == "illumina")$sample_id, 
                           "_chrM_shifted.txt")
names(depthFiles_wgs_2) <- filter(Metadata, platform == "illumina")$sample_id
#all(file.exists(depthFiles_wgs_2))

depthFiles_wgs_22 <- paste0("../results/depth/", 
                            filter(Metadata, platform == "illumina")$sample_id, 
                            "_chr22.txt")
names(depthFiles_wgs_22) <- filter(Metadata, platform == "illumina")$sample_id
#all(file.exists(depthFiles_wgs_22))


## COVERAGE / SUMMARY STATS
Cov_wgs <- bind_rows(
    Cov_wgs_1 <- lapply(names(depthFiles_wgs_1), function(sid) {
        read_tsv(depthFiles_wgs_1[sid], col_names = FALSE, col_types = "cii") %>%
            mutate(sample_id = sid) %>%
            select(Pos = X2, Cov = X3, sample_id) %>%
            filter(Pos >= 4000, Pos < 12000)
    }) %>% Reduce("bind_rows", .),
    Cov_wgs_2 <- lapply(names(depthFiles_wgs_2), function(sid) {
        read_tsv(depthFiles_wgs_2[sid], col_names = FALSE, col_types = "cii") %>%
            mutate(sample_id = sid) %>%
            select(Pos = X2, Cov = X3, sample_id) %>%
            filter(Pos >= 4000, Pos < 12569) %>%
            mutate(Pos = sapply(Pos, us8k))
    })
) %>% filter(Pos != 3107) %>%
    arrange(sample_id, Pos)

Cov_wgs_22 <- lapply(names(depthFiles_wgs_22), function(sid) {
    read_tsv(depthFiles_wgs_22[sid], col_names = FALSE, col_types = "cii") %>%
        mutate(sample_id = sid) %>%
        select(Pos = X2, Cov = X3, sample_id)
}) %>% Reduce("bind_rows", .)

summTable_wgs <- left_join(Cov_wgs, Metadata, by = join_by(sample_id)) %>%
    group_by(sample_id) %>%
    summarise(mean_cov = round(mean(Cov)),
              stdev_cov = round(sd(Cov))) %>%
    ungroup() %>%
    left_join(Metadata, by = join_by(sample_id)) %>%
    select(tissue, individual_id, sample_id, mean_cov, stdev_cov)

summTable_wgs_22 <- left_join(Cov_wgs_22, Metadata, by = join_by(sample_id)) %>%
    group_by(sample_id) %>%
    summarise(mean_cov = round(mean(Cov)),
              stdev_cov = round(sd(Cov))) %>%
    ungroup() %>%
    left_join(Metadata, by = join_by(sample_id)) %>%
    select(tissue, individual_id, sample_id, mean_cov, stdev_cov)

summTable_illumina <- summTable_wgs %>% select(tissue, individual_id, sample_id, mean_cov_MT = mean_cov) %>%
    full_join(summTable_wgs_22 %>% select(tissue, individual_id, sample_id, mean_cov_chr22 = mean_cov),
              by = join_by(tissue, individual_id, sample_id)) %>%
    mutate(relative_copynumber = round(mean_cov_MT/mean_cov_chr22))

bind_rows(summTable_illumina, summTable_ont) %>% filter(tissue == "edta_blood") %>%
    left_join(Metadata %>% select(sample_id, platform), by = join_by(sample_id)) %>%
    select(individual_id, platform, cov_MT = mean_cov_MT, 
           cov_chr22 = mean_cov_chr22, copy_nr = relative_copynumber) %>%
    pivot_wider(names_from = platform, values_from = c(cov_MT, cov_chr22, copy_nr)) %>%
    kable(caption = "Relative mtDNA copy number in EDTA blood samples (ONT and Illumina)") %>%
    kable_styling(bootstrap_options = "striped", full_width = FALSE)
Relative mtDNA copy number in EDTA blood samples (ONT and Illumina)
individual_id cov_MT_illumina cov_MT_ont cov_chr22_illumina cov_chr22_ont copy_nr_illumina copy_nr_ont
c 6325 436 46 39 138 11
d 6433 452 45 40 143 11

These results indicate a stark contrast between the two sequencing technologies in their relative coverages of the nuclear and mitochondrial genomes. According to the typically observed values (around 100, source), it seems that Illumina is more accurate and ONT is biased towards nuclear DNA.

Mutational load and error

Due to a higher error rate in long-read technology (at least compared to Illumina NGS sequencing technology), it is not possible to confidently estimate mutational loads. However, it is still possible to study relative mutational loads across tissues.

Sequencing error

First, we can compare the error rate of ONT and Illumina sequencing platfoms by leveraging on the EDTA blood samples from the same 2 individuals. The sequencing error rate is here estimated using the chr22:20,000,000-20,100,000 region, which should only exhibit diploid genotypes. Only positions with a coverage within two standard deviations of the mean coverage are considered. We then calculate the number of erroneous calls as the bases that are different from the genotype.

The following plots represent this error along the genomic reference studied for the four samples (top samples, ONT, bottom samples, Illumina). Missing regions correspond to low coverage. The discrete nature of the coverage results in the “steps” seen in the plots.

In the case of the Illumina samples, some small regions seem to be problematic for both samples, suggesting that sequence composition is playing a role.

perNucFiles <- paste0("../results/depth/", 
                      filter(Metadata, tissue == "edta_blood")$sample_id, 
                      "_per_nuc_depth_chr22.txt")
names(perNucFiles) <- filter(Metadata, tissue == "edta_blood")$sample_id
#all(file.exists(perNucFiles))


perNucCov <- lapply(names(perNucFiles), function(sid) {
    read_tsv(perNucFiles[sid], col_select = c(1,2,4,6,8,10),
             c("seq", "pos", "a", "A", "c", "C", "g", "G", "t", "T")) %>%
        mutate(sample_id = sid)
}) %>% Reduce("bind_rows", .)
perNucCov_filt <- perNucCov %>% mutate(Cov = A + C + G + T) %>%
    group_by(sample_id) %>%
    mutate(mean_cov = mean(Cov), std_cov_2 = 2*sd(Cov)) %>%
    mutate(keep = ifelse(Cov > (mean_cov - std_cov_2) & Cov < (mean_cov + std_cov_2), TRUE, FALSE)) %>%
    filter(keep, Cov > 0) %>%
    select(-mean_cov, -std_cov_2, -keep) %>%
    ungroup()

errors_per_base <- perNucCov_filt %>% 
    pivot_longer(A:T, names_to = "base", values_to = "count") %>%
    arrange(sample_id, pos, desc(count)) %>%
    group_by(sample_id, pos) %>%
    mutate(hom_GT = base[1], het_GT = ifelse(count[2] > 0, base[2], NA)) %>%
    mutate(hom_right = count[1]) %>%
    mutate(het_right = ifelse(!is.na(het_GT), min(Cov[1]/2,count[1])+min(Cov[1]/2,count[2]), NA)) %>%
    select(sample_id, pos, Cov, hom_right, het_right) %>%
    distinct() %>%
    mutate(right = pmax(hom_right, het_right, na.rm = TRUE)) %>%
    mutate(wrong = Cov - right) %>%
    select(-hom_right, -het_right)

Plot3 <- errors_per_base %>%
    ggplot(aes(x = pos, y = wrong/Cov)) +
    geom_point(size = 1) +
    facet_grid(sample_id ~ .)

Plot3

invisible(png("./Figs/GT_dispersion.png", height = 2000, width = 6000, res = 150))
Plot3
invisible(dev.off())

The following table shows the error rate calculated based on this region.

errors_per_base %>% group_by(sample_id) %>%
    summarise(error_rate = formatC(sum(wrong)/sum(Cov), digits = 2)) %>%
    left_join(Metadata %>% select(sample_id, platform)) %>%
    kable(caption = "Error rate in chr22:20,000,000-20,100,000") %>%
    kable_styling(bootstrap_options = "striped", full_width = FALSE)
Error rate in chr22:20,000,000-20,100,000
sample_id error_rate platform
6226-CT-0016 0.052 ont
6226-CT-0017 0.054 ont
SL443862 0.0035 illumina
SL443863 0.0037 illumina

Mutational load

Due to the high error rate of the ONT technology, it is not possible to calculate an absolute mutational load in the mtDNA at the single molecule level. However, it is still possible to assess differences in mutational load across tissues.

To estimate the mutational load M, we calculate the proportion of basecalls per position that are different from the genotype at the position. Then, we logit-transform these values (which would be ranging from 0 to 1) so that they are normally distributed M = log2(e/(1-e)).

perNucFiles2 <- paste0("../results/depth/", 
                       filter(Metadata, platform == "ont", individual_id %in% c("a","b"))$sample_id, 
                       "_per_nuc_depth_chrM.txt")
names(perNucFiles2) <- filter(Metadata, platform == "ont", individual_id %in% c("a","b"))$sample_id
#all(file.exists(perNucFiles2))


perNucCov2 <- lapply(names(perNucFiles2), function(sid) {
    read_tsv(perNucFiles2[sid], col_select = c(1,2,4,6,8,10),
             c("seq", "pos", "a", "A", "c", "C", "g", "G", "t", "T")) %>%
        mutate(sample_id = sid)
}) %>% Reduce("bind_rows", .) %>%
    mutate(Cov = A + C + G + T)

mut_load <- perNucCov2 %>% 
    pivot_longer(A:T, names_to = "base", values_to = "count") %>%
    arrange(sample_id, pos, desc(count)) %>%
    group_by(sample_id, pos) %>%
    mutate(hom_GT = base[1]) %>%
    mutate(right = count[1]) %>%
    select(sample_id, pos, Cov, right) %>%
    distinct() %>%
    filter(pos != 3107) %>%
    left_join(select(Metadata, sample_id, tissue), by = join_by(sample_id)) %>%
    mutate(mut_load = (Cov - right)/Cov)
mut_load$mut_load <- transf.betareg(mut_load$mut_load)
mut_load$M <- log2(mut_load$mut_load / (1 - mut_load$mut_load))

Plot4_dist <- mut_load %>% ggplot(aes(x = M)) + 
    geom_density(aes(colour = tissue)) +
    ggtitle("mtDNA non-GT basecall proportion") +
    xlim(-10, 0)

#Plot4 <- mut_load %>%
#    ggplot(aes(x = pos, y = M)) +
#    geom_point() +
#    facet_grid(sample_id ~ .)

Plot4_dist

#Plot4

invisible(png("./Figs/MT_M_dist.png", height = 1000, width = 1500, res = 150))
Plot4_dist
invisible(dev.off())

#invisible(png("./Figs/MT_M_load.png", height = 2000, width = 6000, res = 150))
#Plot4
#invisible(dev.off())

However, this is just an average mutational load in the entire mtDNA population. We can now focus on the distribution of “errors” per read, i.e., the proportion of mismatches:matches in the read alignment. We use a logit transformation of the proportion of mismatches in the read:

M = logit(mismatches/(mismatches+matches))

snvsPerReadFiles <- paste0("../results/snvs_per_read/", 
                           filter(Metadata, platform == "ont", individual_id %in% c("a","b"))$sample_id,
                           ".tsv")
names(snvsPerReadFiles) <- filter(Metadata, platform == "ont", individual_id %in% c("a","b"))$sample_id
#all(file.exists(snvsPerReadFiles))

snvsPerRead <- lapply(names(snvsPerReadFiles), function(sid) {
                          #message(sid)
    read_tsv(snvsPerReadFiles[sid], col_types = "cii") %>%
    group_by(read_id) %>%
    slice_max(matches, n = 1, with_ties = FALSE) %>%
    ungroup() %>%
    mutate(sample_id = sid)
}) %>% Reduce("bind_rows", .) %>%
    left_join(select(Metadata, sample_id, tissue), by = join_by(sample_id)) %>%
    mutate(mut_load = mismatches / (matches + mismatches))

snvsPerRead$mut_load <- transf.betareg(snvsPerRead$mut_load)
snvsPerRead$M <- log2(snvsPerRead$mut_load / (1 - snvsPerRead$mut_load))

Plot5 <- snvsPerRead %>% filter(M > -10) %>%
    ggplot() +
    geom_density(aes(x = M, colour = tissue)) +
    geom_vline(xintercept = -3.5, linetype = 2) +
    ggtitle("mtDNA non-reference basecall proportion PER READ") +
    xlim(-10, 0)
#Plot6 <- snvsPerRead %>% filter(M > -10) %>% 
#    ggplot() +
#    geom_point(aes(x = M, y = matches + mismatches)) +
#    facet_grid(sample_id~.)

Plot5

invisible(png("./Figs/MT_M_perRead_dist.png", height = 1000, width = 1500, res = 150))
Plot5
invisible(dev.off())

Using M as a proxy for the mutational load in the mtDNA sequence that was the template for the read, we observerve that the mtDNA molecules exhibit a marked bimodal distribution. A population of mtDNA molecules harbours more mutations than the prevailing mtDNA population. The threshold seems to be around 8% error rate (M = -3.5). See dashed vertical line above.

Strikingly, the relative proportions of these mtDNA species are highly tissue-specific.

snvsPerRead %>%
    mutate(high_mut_load = ifelse(M > -2.8, TRUE, FALSE)) %>%
    group_by(sample_id, tissue) %>%
    summarise(N = n(), nb_high_mut = sum(high_mut_load), prop_high_mut = nb_high_mut / N, .groups = "drop") %>%
    mutate(prop_high_mut = formatC(prop_high_mut, digits = 2)) %>%
    kable(caption = "Relative proportions of high-mutant mtDNA reads") %>%
    kable_styling(bootstrap_options = "striped", full_width = FALSE)
Relative proportions of high-mutant mtDNA reads
sample_id tissue N nb_high_mut prop_high_mut
6226-CT-0010 PFC 24612 2305 0.094
6226-CT-0011 heart 653011 3169 0.0049
6226-CT-0012 muscle 55895 2754 0.049
6226-CT-0013 liver 212363 1624 0.0076
6226-CT-0014 kidney 60584 3626 0.06
6226-CT-0015 colon 15712 1210 0.077
6226-CT-0018 cerebellum 14415 3122 0.22

These highly-mutated mtDNA sequences tend to be smaller than the prevailing mtDNA population. The following table tests for each sample (tissue) the sizes of the two groups of mtDNA molecules, those with more than 12% mutational load and those with less.

gc_mt <- left_join(snvsPerRead,
lapply(unique(snvsPerRead$sample_id), function(sid) {
    #message(sid)
    read_tsv(paste0("/data1/romain/ont/results/gc_content/", sid, "_chrM.tsv"), col_types = "cd") %>%
    group_by(read_id) %>%
    slice_head(n = 1) %>%
    ungroup() %>%
    mutate(sample_id = sid)
}) %>% Reduce("bind_rows", .), by = join_by(read_id, sample_id))

Data <- gc_mt %>%
    mutate(high_mut_load = as.integer(ifelse(M > -2.8, 1, 0))) %>%
    mutate(len = matches + mismatches)

lapply(unique(snvsPerRead$sample_id), function(sid) {
    glm(high_mut_load ~ len, data = filter(Data, sample_id == sid) %>% mutate(len = scale2(len)), family = "binomial") %>%
        broom::tidy() %>% filter(term != "(Intercept)") %>%
        mutate(sample_id = sid) %>%
        select(sample_id, everything())
}) %>% Reduce("bind_rows", .) %>%
    select(sample_id, term, estimate, statistic, p.value) %>%
    mutate(signif = stars.pvalue(p.value)) %>%
    mutate(estimate = formatC(estimate, digits = 2), statistic = formatC(statistic, digits = 2), p.value = formatC(p.value, digits = 2)) %>%
    kable(caption = "Association between read length and highly-mutated mtDNA") %>%
    kable_styling(bootstrap_options = "striped", full_width = FALSE)
Association between read length and highly-mutated mtDNA
sample_id term estimate statistic p.value signif
6226-CT-0010 len -1.2 -30 2e-192 ***
6226-CT-0011 len -2.3 -47 0 ***
6226-CT-0012 len -1.7 -48 0 ***
6226-CT-0013 len -0.081 -3 0.0027 **
6226-CT-0014 len -0.8 -31 3.7e-211 ***
6226-CT-0015 len -1.3 -25 7.1e-143 ***
6226-CT-0018 len -0.98 -31 1.4e-215 ***
Plot7 <- Data %>% ggplot(aes(x = as.factor(high_mut_load), y = len)) +
    #geom_violin() +
    geom_boxplot(outlier.shape = NA) +
    facet_grid(~tissue)

Plot7

invisible(png("./Figs/MT_mutator_species.png", height = 1000, width = 1500, res = 150))
Plot7
invisible(dev.off())

It is unlikely that this is a sequencing artifact, since different tissues present different proportions of these “high mutator” mtDNA species. To confirm this, we can test whether chr22 also has a similar bimodal distribution.

snvsPerReadFiles_chr22 <- paste0("../results/snvs_per_read/", 
                                 filter(Metadata, platform == "ont", individual_id %in% c("a","b"))$sample_id, 
                                 "_chr22.tsv")
names(snvsPerReadFiles_chr22) <- filter(Metadata, platform == "ont", individual_id %in% c("a","b"))$sample_id
#all(file.exists(snvsPerReadFiles_chr22))

snvsPerRead_chr22 <- lapply(names(snvsPerReadFiles_chr22), function(sid) {
                          #message(sid)
    read_tsv(snvsPerReadFiles_chr22[sid], col_types = "cii") %>%
    mutate(sample_id = sid)
}) %>% Reduce("bind_rows", .) %>%
    mutate(mut_load = mismatches / (matches + mismatches)) %>%
    left_join(select(Metadata, sample_id, tissue), by = join_by(sample_id))

snvsPerRead_chr22$mut_load <- transf.betareg(snvsPerRead_chr22$mut_load)
snvsPerRead_chr22$M <- log2(snvsPerRead_chr22$mut_load / (1 - snvsPerRead_chr22$mut_load))

Plot8 <- bind_rows(snvsPerRead %>% mutate(seqnames = "mtDNA"),
                   snvsPerRead_chr22 %>% mutate(seqnames = "chr22") %>%
                       select(read_id, matches:sample_id, tissue, mut_load, M, seqnames)) %>%
    filter(M > -10) %>%
    ggplot() +
    geom_density(aes(x = M, colour = tissue)) +
    facet_grid(seqnames~.) +
    geom_vline(xintercept = -3.5, linetype = 2) +
    ggtitle("mtDNA non-reference basecall proportion PER READ")

Plot8

invisible(png("./Figs/chr22_M_perRead_dist.png", height = 1800, width = 1500, res = 150))
Plot8
invisible(dev.off())

We also test now whether the GC content is different between these different mtDNA populations.

lapply(unique(snvsPerRead$sample_id), function(sid) {
    glm(high_mut_load ~ gc_proportion, data = filter(Data, sample_id == sid) %>% 
                                          mutate(gc_proportion = scale2(gc_proportion)),
            family = "binomial") %>%
        broom::tidy() %>% filter(term != "(Intercept)") %>%
        mutate(sample_id = sid) %>%
        select(sample_id, everything())
}) %>% Reduce("bind_rows", .) %>%
    select(sample_id, term, estimate, statistic, p.value) %>%
    mutate(signif = stars.pvalue(p.value)) %>%
    mutate(estimate = formatC(estimate, digits = 2), statistic = formatC(statistic, digits = 2), p.value = formatC(p.value, digits = 2)) %>%
    kable(caption = "Association between GC proportion and highly-mutated mtDNA") %>%
    kable_styling(bootstrap_options = "striped", full_width = FALSE)
Association between GC proportion and highly-mutated mtDNA
sample_id term estimate statistic p.value signif
6226-CT-0010 gc_proportion -1.2 -57 0 ***
6226-CT-0011 gc_proportion -0.94 -1.1e+02 0 ***
6226-CT-0012 gc_proportion -1.2 -75 0 ***
6226-CT-0013 gc_proportion -0.94 -45 0 ***
6226-CT-0014 gc_proportion -1.1 -74 0 ***
6226-CT-0015 gc_proportion -1.3 -43 0 ***
6226-CT-0018 gc_proportion -1.3 -51 0 ***
Plot9 <- Data %>% ggplot(aes(x = as.factor(high_mut_load), y = gc_proportion)) +
    geom_boxplot(outlier.shape = NA) +
    ylim(0.25, 0.55) +
    facet_grid(~tissue)

Plot9

invisible(png("./Figs/MT_mutator_species_GC_content.png", height = 1000, width = 1500, res = 150))
Plot9
invisible(dev.off())

In order to characterize better this population of highly-mutated mtDNA sequences, we run a generalized linear model where the mtDNA class (highly mutated, 1 or not highly mutated, 0) is predicted simultaneously by the GC proportion (as before), average sequencing quality of the read, number of deletions in the aligned read (normalised by its length), and proportion of the aligned read made up of deletion (i.e., the total sum of the deletion lengths in the alignment divided by the read length).

highly mutated mtDNA ~ GC_proportion + Avg_read_qual + n_deletions + prop_del_bases

(All covariates are standardized to render coefficients comparable).

delFiles1 <- paste0("../results/deletions/",
                    filter(Metadata, platform == "ont", individual_id %in% c("a","b"))$sample_id, 
                    "_chrM.txt")
names(delFiles1) <- filter(Metadata, platform == "ont", individual_id %in% c("a","b"))$sample_id
#all(file.exists(delFiles1))
delFiles2 <- paste0("../results/deletions/",
                    filter(Metadata, platform == "ont", individual_id %in% c("a","b"))$sample_id, 
                    "_chrM_shifted.txt")
names(delFiles2) <- filter(Metadata, platform == "ont", individual_id %in% c("a","b"))$sample_id
#all(file.exists(delFiles2))

Del1 <- lapply(names(delFiles1), function(sid) {
    #message(sid)
    read_tsv(delFiles1[sid], col_types = "ciiid") %>%
    mutate(sample_id = sid)
}) %>% Reduce("bind_rows", .) %>%
    filter(del_end - del_start > 1)

Del2 <- lapply(names(delFiles2), function(sid) {
    #message(sid)
    read_tsv(delFiles2[sid], col_types = "ciiid") %>%
    mutate(sample_id = sid)
}) %>% Reduce("bind_rows", .) %>%
    filter(del_end - del_start > 1)

Del2$del_start <- sapply(Del2$del_start, us8k)
Del2$del_end <- sapply(Del2$del_end, us8k)

Del <- bind_rows(Del1, Del2) %>%
    group_by(sample_id, read_id, del_start, del_end, read_length, avg_qual) %>%
    summarise(support = n(), .groups = "drop")
rm(Del1, Del2)
invisible(gc())
nb_dels <- Del %>%
    mutate(del_size = ifelse(del_start <= del_end, 
                             del_end - del_start, 
                             del_end + (16569 - del_start))) %>%
    group_by(sample_id, read_id, read_length, avg_qual) %>%
    summarise(n_dels = n(), n_deleted_bases = sum(del_size), 
              median_del_size = median(del_size),
              .groups = "drop") %>%
    mutate(n_dels_per_base = n_dels / read_length,
           n_deleted_bases_per_base = n_deleted_bases / read_length) %>%
    group_by(sample_id, read_id) %>%
    slice_max(read_length, n = 1) %>%
    ungroup()

Data <- inner_join(Data %>% select(sample_id, tissue, read_id = read_id, everything()), nb_dels, by = join_by(sample_id, read_id))

Tests <- lapply(unique(snvsPerRead$sample_id), function(sid) {
    glm(high_mut_load ~ gc_proportion + avg_qual + n_dels_per_base + n_deleted_bases_per_base + len,
        data = filter(Data, sample_id == sid) %>%
            mutate(gc_proportion = scale2(gc_proportion), 
                   avg_qual = scale2(avg_qual), 
                   n_dels_per_base = scale2(n_dels_per_base), 
                   len = scale2(len),
                   n_deleted_bases_per_base = scale2(n_deleted_bases_per_base)),
        family = "binomial") %>%
    broom::tidy() %>% filter(term != "(Intercept)") %>%
    mutate(sample_id = sid) %>%
    select(sample_id, everything())
})

for (Test in Tests) {
    Test %>% select(sample_id, term, estimate, statistic, p.value) %>%
        mutate(signif = stars.pvalue(p.value)) %>%
        mutate(estimate = formatC(estimate, digits = 3), 
               statistic = formatC(statistic, digits = 3), 
               p.value = formatC(p.value, digits = 3)) %>%
        kable(caption = paste0(Test$sample_id[1], " (", Metadata[Metadata$sample_id == Test$sample_id[1],]$tissue[1], ")")) %>%
        kable_styling(bootstrap_options = "striped", full_width = FALSE) %>%
        print()
    cat("\n")
}
6226-CT-0010 (PFC)
sample_id term estimate statistic p.value signif
6226-CT-0010 gc_proportion -1.09 -43.8 0 ***
6226-CT-0010 avg_qual 0.0315 1.13 0.259
6226-CT-0010 n_dels_per_base -0.143 -4.47 7.95e-06 ***
6226-CT-0010 n_deleted_bases_per_base 0.0109 0.327 0.743
6226-CT-0010 len -0.492 -11.9 8.82e-33 ***
6226-CT-0011 (heart)
sample_id term estimate statistic p.value signif
6226-CT-0011 gc_proportion -0.925 -85.1 0 ***
6226-CT-0011 avg_qual 0.491 23 8.45e-117 ***
6226-CT-0011 n_dels_per_base 0.557 34.8 2.54e-265 ***
6226-CT-0011 n_deleted_bases_per_base 0.0359 3.25 0.00117 **
6226-CT-0011 len -1.2 -26.6 1.92e-156 ***
6226-CT-0012 (muscle)
sample_id term estimate statistic p.value signif
6226-CT-0012 gc_proportion -1.09 -60.1 0 ***
6226-CT-0012 avg_qual 0.241 8.87 7.62e-19 ***
6226-CT-0012 n_dels_per_base 0.326 13.7 8.35e-43 ***
6226-CT-0012 n_deleted_bases_per_base 0.00639 0.184 0.854
6226-CT-0012 len -1.19 -29.3 3.58e-188 ***
6226-CT-0013 (liver)
sample_id term estimate statistic p.value signif
6226-CT-0013 gc_proportion -1.28 -49.9 0 ***
6226-CT-0013 avg_qual 0.488 16.8 2.1e-63 ***
6226-CT-0013 n_dels_per_base 0.765 32 1.44e-224 ***
6226-CT-0013 n_deleted_bases_per_base 0.0168 1.49 0.135
6226-CT-0013 len 0.0283 0.913 0.361
6226-CT-0014 (kidney)
sample_id term estimate statistic p.value signif
6226-CT-0014 gc_proportion -1.23 -68.8 0 ***
6226-CT-0014 avg_qual 0.331 15.2 3.7e-52 ***
6226-CT-0014 n_dels_per_base 0.377 16.7 1.92e-62 ***
6226-CT-0014 n_deleted_bases_per_base 0.0573 3.51 0.000452 ***
6226-CT-0014 len -0.383 -13.5 2.21e-41 ***
6226-CT-0015 (colon)
sample_id term estimate statistic p.value signif
6226-CT-0015 gc_proportion -1.28 -38.6 0 ***
6226-CT-0015 avg_qual 0.355 8.58 9.69e-18 ***
6226-CT-0015 n_dels_per_base 0.358 6.86 6.89e-12 ***
6226-CT-0015 n_deleted_bases_per_base 0.16 3.66 0.000251 ***
6226-CT-0015 len -0.91 -14.7 1.24e-48 ***
6226-CT-0018 (cerebellum)
sample_id term estimate statistic p.value signif
6226-CT-0018 gc_proportion -1.27 -44.7 0 ***
6226-CT-0018 avg_qual 0.167 6 2.02e-09 ***
6226-CT-0018 n_dels_per_base 0.114 3.78 0.000158 ***
6226-CT-0018 n_deleted_bases_per_base 0.0385 1.82 0.0692 .
6226-CT-0018 len -0.636 -16.4 3.83e-60 ***

These results indicate that these highly-mutated mtDNA sequences have significantly lower GC content than the main mtDNA population in all samples. Interestingly, the average sequencing quality is higher in the highly-mutated mtDNA sequences, supporting the notion that these are not just sequencing artifacts. This is observed in all samples with the exception of the PFC, where this was not significant. The number of deletion events was also significantly associated with the high mutational load in every sample, although without a consistent direction of association. While in most samples the number of deletions was hight in highly-mutated mtDNA reads, the two brain tissues (cerebellum and PFC) exhibited the opposing trend, i.e., a decreased number od deletion events in highly-mutated mtDNA reads. The size of the mutation events had a smaller association with the highly-mutated status, and was only statistically significant in kidney and colon, where it was positively associated. Finally, the read size was negatively associated with highly-mutated mtDNA, i.e., longer sequences tend to have lower mutation frequencies (with the exception of liver, where the association was not statistically significant).

## varFiles <- paste0("../results/", filter(Metadata, platform == "ont")$sample_id, "_chrM_combined_filt_splt.vcf")
## names(varFiles) <- filter(Metadata, platform == "ont")$sample_id
## 
## #varFiles1 <- paste0("../results/", filter(Metadata, platform == "ont")$sample_id, "_chrM_variants.txt")
## #names(varFiles1) <- filter(Metadata, platform == "ont")$sample_id
## 
## #varFiles2 <- paste0("../results/", filter(Metadata, platform == "ont")$sample_id, "_chrM_shifted_variants.txt")
## #names(varFiles2) <- filter(Metadata, platform == "ont")$sample_id
## 
## 
## 
## 
## #########################
## ##### READ VARIANTS #####
## #########################
## 
## require("vcfR")
## 
## # VCF files generated using Mutect2 are already fixed in terms of coordinates
## chrM_vars <- lapply(names(varFiles), function(sid) {
##     File <- varFiles[sid]
##     read_tsv(File, col_types = "cciccdcdcdiiiid") %>% filter(Filter == "PASS") %>%
##         filter(Pos >= 4000, Pos < 12000) %>%
##         filter(Type %in% c(1,2)) %>%
##         mutate(sample_id = sid) %>%
##         select(-ID, -Filter)
## }) %>% Reduce("bind_rows", .)
## 
## 
## 
## 
## 
## # read chrM pos. 4000-11999 from normal chrM alignments
## chrM_vars <- lapply(names(varFiles1), function(sid) {
##     File <- varFiles1[sid]
##     read_tsv(File, col_types = "cciccdcdcdiiiid") %>% filter(Filter == "PASS") %>%
##         filter(Pos >= 4000, Pos < 12000) %>%
##         filter(Type %in% c(1,2)) %>%
##         mutate(sample_id = sid) %>%
##         select(-ID, -Filter)
## }) %>% Reduce("bind_rows", .)
## 
## # read chrM pos. 1-3999 and 12000-16569 from alignments to shifted reference,
## # which corresponds to 8570-12568 and 4000-8569
## chrM_shift_vars <- lapply(names(varFiles2), function(sid) {
##     File <- varFiles2[sid]
##     read_tsv(File, col_types = "cciccdcdcdiiiid") %>% filter(Filter == "PASS") %>%
##         filter(Pos >= 4000, Pos < 12569) %>%
##         filter(Type %in% c(1,2)) %>%
##         mutate(sample_id = sid) %>%
##         mutate(Pos = sapply(Pos, us8k)) %>%
##         select(-ID, -Filter)
## }) %>% Reduce("bind_rows", .)
## 
## all_vars <- bind_rows(chrM_vars, chrM_shift_vars) %>%
##     left_join(Metadata, by = join_by(sample_id)) %>%
##     select(tissue, 
## all_vars
## rm(chrM_vars, chrM_shift_vars)
## invisible(gc())
## 
## all_vars %>% filter(tissue %in% c("edta_blood_1", "edta_blood_2"), Type == 1) %>%
##     arrange(Pos)
## all_vars %>% filter(tissue %in% c("edta_blood_1", "edta_blood_2"), Type == 2) %>%
##     filter(Pos == 6)
##     arrange(Pos)

Deletions

Distribution of ONT read lengths

A first rough estimate of the presence and abundance of deletions is the length of the mtDNA-aligned reads.

The density plots show the distribution of ONT read lengths aligned to either the mtDNA (rCRS) or the chr22. The vertical dashed line indicates the rCRS length (16,569).

Comparison of deletions across tissues

We can first formally compare the sizes of the reads across tissues.

#require('rstatix')
#compare_means(readlength ~ tissue, data = Len %>% 
#              left_join(Metadata %>% select(sample_id, individual_id, tissue), by = join_by(sample_id)) %>%    
#              filter(individual_id %in% c("a", "b")))
Tissues <- Metadata %>% filter(individual_id %in% c("a", "b")) %>% pull(tissue)
my_comparisons <- lapply(1:(length(Tissues)-1), function(i) lapply((i+1):length(Tissues), function(j) return(c(Tissues[i],Tissues[j])))) %>%
    unlist(recursive = F)

lenBoxPlot <- Len %>%
    left_join(Metadata %>% select(sample_id, individual_id, tissue), by = join_by(sample_id)) %>%
    ggplot(aes(x = tissue, y = log10(readlength))) +
    geom_boxplot(outlier.shape = NA)

lenBoxPlot

Len %>%
    left_join(Metadata %>% select(sample_id, individual_id, tissue), by = join_by(sample_id)) %>%
    mutate(log10_readlength = log10(readlength)) %>%
    filter(!is.infinite(log10_readlength)) %>%
    compare_means(formula = log10_readlength ~ tissue, data = ., comparisons = my_comparisons, label = "p.signif") %>%
    kable(caption = "Comparisong read length across tissues") %>% 
    kable_styling(bootstrap_options = "striped", full_width = FALSE)
Comparisong read length across tissues
.y. group1 group2 p p.adj p.format p.signif method
log10_readlength PFC heart 0.0000000 0.00000 < 2e-16 **** Wilcoxon
log10_readlength PFC muscle 0.0000000 0.00000 < 2e-16 **** Wilcoxon
log10_readlength PFC liver 0.0000000 0.00000 < 2e-16 **** Wilcoxon
log10_readlength PFC kidney 0.0000000 0.00000 < 2e-16 **** Wilcoxon
log10_readlength PFC colon 0.0000000 0.00000 < 2e-16 **** Wilcoxon
log10_readlength PFC edta_blood 0.0000000 0.00000 < 2e-16 **** Wilcoxon
log10_readlength PFC cerebellum 0.0000000 0.00000 < 2e-16 **** Wilcoxon
log10_readlength heart muscle 0.0076086 0.01500 0.0076 ** Wilcoxon
log10_readlength heart liver 0.0000000 0.00000 < 2e-16 **** Wilcoxon
log10_readlength heart kidney 0.0000000 0.00000 < 2e-16 **** Wilcoxon
log10_readlength heart colon 0.0000000 0.00000 < 2e-16 **** Wilcoxon
log10_readlength heart edta_blood 0.0000000 0.00000 < 2e-16 **** Wilcoxon
log10_readlength heart cerebellum 0.0000000 0.00000 < 2e-16 **** Wilcoxon
log10_readlength muscle liver 0.0000000 0.00000 < 2e-16 **** Wilcoxon
log10_readlength muscle kidney 0.0000000 0.00000 < 2e-16 **** Wilcoxon
log10_readlength muscle colon 0.0000000 0.00000 < 2e-16 **** Wilcoxon
log10_readlength muscle edta_blood 0.0000000 0.00000 < 2e-16 **** Wilcoxon
log10_readlength muscle cerebellum 0.0000000 0.00000 < 2e-16 **** Wilcoxon
log10_readlength liver kidney 0.0000000 0.00000 < 2e-16 **** Wilcoxon
log10_readlength liver colon 0.0000000 0.00000 < 2e-16 **** Wilcoxon
log10_readlength liver edta_blood 0.0000000 0.00000 < 2e-16 **** Wilcoxon
log10_readlength liver cerebellum 0.0000000 0.00000 < 2e-16 **** Wilcoxon
log10_readlength kidney colon 0.0000448 0.00013 4.5e-05 **** Wilcoxon
log10_readlength kidney edta_blood 0.0000000 0.00000 < 2e-16 **** Wilcoxon
log10_readlength kidney cerebellum 0.2383921 0.24000 0.2384 ns Wilcoxon
log10_readlength colon edta_blood 0.0000000 0.00000 < 2e-16 **** Wilcoxon
log10_readlength colon cerebellum 0.0000000 0.00000 1.1e-08 **** Wilcoxon
log10_readlength edta_blood cerebellum 0.0000000 0.00000 < 2e-16 **** Wilcoxon
lm(formula = log10_readlength ~ tissue, data = Len %>%
    left_join(Metadata %>% select(sample_id, individual_id, tissue), by = join_by(sample_id)) %>%
    mutate(log10_readlength = log10(readlength)) %>%
    filter(!is.infinite(log10_readlength))) %>%
    broom::tidy() %>%
    kable(caption = "Linear model comparing  read length across tissues") %>% 
    kable_styling(bootstrap_options = "striped", full_width = FALSE)
Linear model comparing read length across tissues
term estimate std.error statistic p.value
(Intercept) 3.2682446 0.0020919 1562.35501 0
tissuecolon 0.0424773 0.0033535 12.66661 0
tissueedta_blood -0.1486640 0.0031622 -47.01311 0
tissueheart 0.1554573 0.0021416 72.58932 0
tissuekidney 0.0488258 0.0025056 19.48694 0
tissueliver -0.0800692 0.0022859 -35.02816 0
tissuemuscle 0.1109153 0.0025117 44.15936 0
tissuePFC 0.0573036 0.0028207 20.31568 0

Now we compare the distributions of deletion lengths ( >100) across tissues.

delBoxPlot <- Del %>%
    mutate(del_size = ifelse(del_start <= del_end, 
                         del_end - del_start, 
                         del_end + (16569 - del_start))) %>%
    filter(del_size >= 100) %>%
    left_join(Metadata %>% select(sample_id, individual_id, tissue), by = join_by(sample_id)) %>%
    ggplot(aes(x = tissue, y = log10(del_size))) +
    geom_boxplot(outlier.shape = NA)

delBoxPlot

lm(formula = log10_del_size ~ tissue + read_length, data = 
    Del %>%
        mutate(del_size = ifelse(del_start <= del_end, 
                             del_end - del_start, 
                             del_end + (16569 - del_start))) %>%
        filter(del_size >= 100) %>%
        left_join(Metadata %>% select(sample_id, individual_id, tissue), by = join_by(sample_id)) %>%
        mutate(log10_del_size = log10(del_size))) %>%
    broom::tidy() %>%
    kable(caption = "Linear model comparing deletion size across tissues") %>% 
    kable_styling(bootstrap_options = "striped", full_width = FALSE)
Linear model comparing deletion size across tissues
term estimate std.error statistic p.value
(Intercept) 2.2687847 0.0641501 35.3668172 0.0000000
tissuecolon -0.0334776 0.1107483 -0.3022853 0.7624502
tissueheart 0.2290750 0.0617619 3.7090054 0.0002109
tissuekidney 0.0281485 0.0752157 0.3742370 0.7082478
tissueliver 0.1691763 0.0737804 2.2929712 0.0219011
tissuemuscle 0.5949035 0.0674216 8.8236385 0.0000000
tissuePFC 0.6128355 0.0717466 8.5416697 0.0000000
read_length 0.0000125 0.0000023 5.4242017 0.0000001

Deletion breakpoints

Here we plot the deletions larger than 100bp in the mtDNA, based on single reads from the ONT sequencing.

The plots are large and are, therefore, exported into svg images in the Figs/ folder. Only the PFC is plotted here as an example.

# mtDNA gene annotation
mt_genes_df <- readRDS("./Data/EnsDb.Hsapiens.v75.Rds") %>%
    as_tibble() %>%
    filter(seqnames == "MT") %>%
    select(start, end, strand, gene_name, gene_biotype) %>%
    unique() %>%
    mutate(seqnames = factor("mtDNA")) %>%
    mutate(gene_name = str_replace(gene_name, "^MT-", "")) %>%
    mutate(drop = ifelse(gene_name %in% c("ATP6", "ND4", "TQ"), TRUE, FALSE)) %>%
    mutate(big = ifelse(gene_name %in% c("RNR1", "RNR2", "ND1", "ND2", 
                                         "CO1", "CO2", "ATP8", "ATP6", 
                                         "CO3", "ND3", "ND4L", "ND4", 
                                         "ND5", "CYB"), TRUE, FALSE)) %>%
    select(seqnames, everything())
#
#tibble(seqnames = "chrM", start = seq(1,16001, by = 100),
#       end = c(seq(100, 16000, by = 100), 16569)) %>%
#write_tsv("./Data/ranges.bed", col_names = FALSE)
#
##   bedtools nuc -fi /ref/MT/Homo_sapiens_assembly38.chrM.fasta \
##   -bed analyses/Data/ranges.bed | \
##   cut -f1-3,5 > analyses/Data/rCRS_GC_content.tsv
#

# GC content of the mtDNA sequence in 100bp windows
mt_gc_content <- read_tsv("./Data/rCRS_GC_content.tsv", 
                          col_names = c("seqnames", "start", "end", "GC_prop"),
                          col_types = "ciid",
                          comment = "#") %>%
    mutate(seqnames = factor("mtDNA"))
for (SID in unique(Del$sample_id)) {
    invisible(svg(paste0("./Figs/circos_", SID, ".svg"), width = 10, height = 10))
    circ_arcs(sid = SID, min_del_len = 100, alpha = 0.5)
    invisible(dev.off())
    invisible(svg(paste0("./Figs/del_cov_circos_", SID, ".svg"), width = 10, height = 10))
    circ_del_cov(sid = SID, min_del_len = 100, alpha = 0.5)
    invisible(dev.off())
}
circ_del_cov(sid = "6226-CT-0010", min_del_len = 100, alpha = 0.5)